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Abstract 

This study contributes to predicting stock prices. In this paper, we have provided a novel machine learning 
and deep learning technique for stock prediction, which capitalizes on the complex emotional patterns of 
investors that can be extracted from Twitter data. By joining two separate datasets, such as Twitter sentiments 
and the associated stock prices, we have applied complex algorithms and evaluated this method, which 
classifies a user’s sentiment into a positive and negative one. It was possible to address the complexity of 
investor behavior within the highly unstable stock market environment. One of the most important problems 
that are encountered by developers is gaining the possibility to properly capture stock price fluctuations in 
the temporal dimension. It implies the necessity to utilize such reliable evaluation metrics as precision-recall 
factors for classification purpose and root mean squared error for regression purpose. by using these metrics, 
the performance of the predictive model can be evaluated in detail, which allows to draw meaningful 
permanent conclusions within the context of how well it reflects the complex relationship between investor 
sentiment and market dynamics. The results of the evaluation conducted as part of the process of validating 
the proposed hypothesis show that, indeed, the predictive accuracy improves significantly from the use of 
machine learning and deep learning algorithms if paired with Twitter users’ emotional sentiment data. This 
combination enhances the predictive capacity of the model and provides meaningful insights into how the 
changes in stock prices are impacted by the sentiment of investors. Moreover, in the process of this study, the 
explanation and detail consideration of the algorithms for feature engineering and data processing were 
provided, which stands crucial for increasing the predictive model’s accuracy The study has a valuable 
implication for both the financial analysis and technical spheres due to the resolution of these methods and 
presents a model for further research activities in stock prediction. To conclude, this interdisciplinary strategy 
that combines sentiment analysis and stock prediction not only advances our understanding of the investor 
behavior but also brings the opportunity for novel applications in the emerging field at the juncture of 
technology and finance. As a result, the current research sheds light on the intricate connection between the 
dynamics of the stock market and investor sentiment, thus offering new insights that can better inform 
predictive modeling research in the financial industry. 
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1. Introduction 


The thesis consists of three stages: sentiment 
analysis, stock price prediction, and machine 
learning algorithms. The first stage uses sentiment 
analysis on a Twitter dataset to classify investor 
tweets about stock [1]. This process requires data 
pre-processing to filter out irrelevant texts. The 
second stage predicts stock prices using regression 
techniques, based on past stock prices. The final 


stage uses machine learning algorithms and deep 
learning to predict stock prices. All algorithms are 
evaluated for accuracy, and the most accurate 
algorithm is selected for future price prediction. The 
thesis's primary objective is to develop an intelligent 
model that provides investors with detailed 
information on acquiring assets in the stock market. 
The model must be robust and require hyper 
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parameter tuning at the end of the training phase [2]. 
2. Literature Survey 

This section views state-of-the-art efforts to identify 
stock-market prices using sentimental analysis and 
deep learning on tweets text data [3]. The study by 
Yichuan Xu and Vlado Keselj IEEE (2019) uses the 
attention based Lstm variant for the prediction. Their 
approach involves combining of finance tweets 
sentiment and stock technical indicators to gains 
better performance from this modified 
LSTM.Accuracy of this model is 56% but future 
improvements could increase the number of 
percentage. Mudinas, Zhang, and Levene (2019) 
classified sentiments from news and tweets into 
eight categories (such as fear and anger) [4]. Only a 
small number of sentimental emotions were 
somewhat correlated with subsequent stock 
movements. Without taking into account emotional 
factors, technical factor-based predictions typically 
yield better results. Khedr & Yaseen (2017) also 
came to the same conclusion. To determine the 
weight for each token and categorize gathered 
financial news into positive and negative sentiment 
categories, they employed the TF-IDF and N-gram 
methods [5]. The K-NN classifier yielded 59.18% 
accuracy for the sentiment attributes method and 
89.80% accuracy for the sentiment and historical 
stock data method. The author did not, however, 
demonstrate the accuracy attained using just 
historical data [6]. Technical aspects appear to be 
the deciding elements in stock prediction while 
extracting sentiment attributes have harmful effects 
on results. Financial articles from 2016-01-01 to 
2020-04-01 were used by Kabbani & Usta (2022) to 
forecast the daily stock trend. Given the speed at 
which opinions are shifting, only the trends for 
today and tomorrow are anticipated. High 
correlation characteristics with the article's emotion 
scores were chosen for the final data set following 
correlation analysis [7]. With the usage of random 
forest, gradient boosting machine algorithm, and 
linear regression, the model achieved an average 
accuracy of 63.58%. A model was presented by 
Weng, Ahmed, and Megahed (2017) to forecast 
stock movement one day in advance. In contrast to 
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sentiment analysis, the author's forecast methods 
included market data, Wikipedia traffic, Google 
news counts, and a range of technical indications. 
The model only took into account changes in Apple 
shares between May 1, 2012, and June 1, 2015. 
combining information from multiple sources, the 
accuracy is 85.8%, indicating that increased data 
can improve the prediction. The study by 
R.Satishkumart JETIR (2020) shows the stock price 
forecast system combines the successful use of long 
short-term memory for technical analysis and 
sensitive analysis for fundamental analysis to 
provide accurate results. Sentiment analysis was 
conducted using the selected keyword [8]. The user 
who is not familiar with the stock market might find 
this system handy. Individuals with varying levels 
of trading experience can utilize this approach to 
forecast stock prices in the future. In the future, the 
technology will allow for intra-day prediction and 
enhance sentiment analysis to obtain an impact 
factor for fundamental analysis free of sarcasm. The 
study by Hatefi Ghahfarrokhi examines how SM 
data may forecast Tehran Stock Exchange (TSE) 
factors by taking into account the closing prices and 
daily returns of three different stocks [9]. A three- 
month period of StockTwits was collected. A 
learning-based and lexicon-based approach was put 
forth to get information from internet forums. 
Furthermore, since the current Persian lexicons are 
unsuitable for SA, a bespoke sentiment lexicon was 
developed. Following the creation and computation 
of daily sentiment indices based on comments, 
novel predictor models using multi-regression 
analysis were put forth [10]. The investigation also 
took into account the quantity of comments and the 
reliability of the individuals. Results indicate that 
characteristics of TSE stocks affect how predictable 
they are. It is demonstrated that mood and comment 
volume may be useful for estimating the daily 
return, and that the trust coefficients of the three 
stocks react differently [11]. 

3. Methodology 

In this section, we provide an in-depth exploration 
of the techniques employed in our project which 
involves preprocessing, sentiment analysis, model 
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development, model training and performance 
measurement [12]. 
3.1. Preprocessing 
We first convert the tweets CSV file into a Pandas 
dataframe. Since the stock market is not open on the 
weekends and specific holidays, such as Good 
Friday and Thanksgiving, we dropped these days 
[13]. Following are the steps we will perform for the 
preprocessing the data using the NLTK: 
e Remove HTML entities 
e Substitute @mentions, urls, etc. with 
whitespace using regular expressions 


e Substitute any non-alphabetic whitespace. 

e All the words in lowercase. 

e Removing stop words. 

e Perform stemming of words — with 
lemmatization 


After the preprocessing using the Natural Language 
Toolkit, we will perform the sentiment analysis on 
each tweet. To generate the sentiment score we shall 
use VADER sentiment analysis. Empty columns 
were ignored, and each tweet is classified as positive 
(value of +1) and negative (value of 0). We then 
averaged the individual sentiment values so that a 
single sentiment value was present for every day. 
We then converted the stock data CSV file into a 
Pandas dataframe and dropped the columns that 
weren’t needed. We will join the stock data 
dataframe and the sentiment data frame into a single 
dataframe [14]. 
3.2. Sentiment Analysis Model: Vader 
The sentiments from the tweets are calculated from 
the dataset that contains twitter data of daily tweets. 
This data contains keywords such as #TSLA. The 
sentiments of tweets are calculated using the twint 
library of Python. This library is responsible for 
providing simple APIs for sentiment analysis with 
regards to NLP related research. 
3.3. Model Development 

Random Forest is an ensemble method that fits 
multiple decision trees on different datasets to 
improve predictive accuracy. It has a large number 
of parameters that can be optimized, including the 
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number of trees and features to consider. 
XGBOOST is a method of combining weak learners 
to a strong learner, with participants that were 
incorrectly classified weighted more heavily than 
those correctly classified. LSTM, or Long Short- 
Term Memory, is an ensemble method that uses 
memory blocks in the recurrent hidden layer to store 
the temporal state of the network and gates to 
control information flow. Bidirectional long-short- 
term memory (Bi-LSTM) and Stacked Bidirectional 
long-short-term memory (Stacked Bi-LSTM) is a 
process that allows input to flow in both directions, 
preserving future and past information. These 
methods are used for regional features and have 
different computational costs. 

3.4. Model Training 
When we got a sentimental score for each single day 
then we converted the stock data CSV file into a 
Pandas dataframe and dropped the columns that 
weren’t needed. We will join the stock data 
dataframe and the sentiment data frame into a single 
dataframe and pass to a model for training. 

3.5. Performance Measurement 
For performance measurement root mean squared 
error (RMSE) and R2 score is used. The results 
indicate that incorporating emotional sentiment of 
users improves the overall performance of the 
system [15]. 

3.6. Sample Dataset Used 
We will use two datasets for this project. 
Stock Price Data: Daily opening prices of the last 
5 years for 5 different companies (Amazon, Apple, 
Facebook, Google, Microsoft, Netflix, Tesla, etc). 
This data was collected from Yahoo! Finance. 
Tweet Data of Stocks: Twitter data containing 
approximately 30-50 tweets per day. will be 
collected using the Twitter API. 
The tweets included the username and the text of the 
tweet. The tweets will be split based on their given 
date, since our project is focused on predicting daily 
opening stock prices in Figure 1. 
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Figure 1 System Architecture 


4. Result and Analysis 
In this result section we show the accuracy obtained 
by the multiple machine learning and deep learning 
algorithms.We use the apple tweets data and stocks 
data for result.Among the all models Stacked BI- 
LSTM gives the highest accuracy and Random 
forest regression model gives the lowest 
accuracy.For training and testing purpose we split 
the dataset into two parts.First 90% use for training 
and remaining is used for testing purpose. 

Table 1 Accuracy for APPLE 


For performance measurement root mean squared 
error (RMSE) and R2 score is used in Table 1. 
Conclusion 

Through this project, we have showcased how the 
sentiment analysis of tweets can be utilized to give 


investors meaningful insights regarding the stocks 
of different companies. Using a Python-based 
library, twint, to collect the data, the project has 
employed several machine learning techniques such 
as Random Forest, XGBOOST, and _ Linear 
Regression to achieve detailed and accurate results. 
Utilizing LSTM, BI-LSTM, and Stack BI-LSTM 
have enabled the Deep Learning Model to process 
the data tiresomely into something that is 
straightforward and readable while doing it from 
both directions. Our project is intended to provide 
investors with refined points so that they can truly 
invest in a way they would like to see. Hence, the 
overall project indicates the researcher’s knowledge 
of combining sentiment analysis and _ highly 
advanced machine learning techniques which offers 
valuable insights for investment community. 
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